arXiv:1506.02084vl [math.ST] 5 Jun 2015 

Exact P-values for Network Interference* * 

Susan Athey^ Dean Eckles^ Gnido W. Imbens' 

Jnne 2015 


We study the calculation of exact p-values for a large class of non-sharp null hypotheses 
about treatment effects in a setting with data from experiments involving members of a 
single connected network. The class includes null hypotheses that limit the effect of one 
unit’s treatment status on another according to the distance between units; for example, the 
hypothesis might specify that the treatment status of immediate neighbors has no effect, or 
that units more than two edges away have no effect. We also consider hypotheses concerning 
the validity of sparsification of a network (for example based on the strength of ties) and 
hypotheses restricting heterogeneity in peer effects (so that, for example, only the number 
or fraction treated among neighboring units matters). Our general approach is to define 
an artificial experiment, such that the null hypothesis that was not sharp for the original 
experiment is sharp for the artificial experiment, and such that the randomization analysis 
for the artificial experiment is validated by the design of the original experiment. 

JEL Classification: C14, C21, C52 

Keywords: Randomization Inference, Interactions, Fisher Exact P-values, SUTVA, Spillovers 

*We are grateful for comments by Peter Aronow, Peter Bickel, Bryan Graham, Brian Karrer, Johan 
Ugander, and seminar and conference participants at Cornell, the California Econometrics Conference, 
the UC Davis Institute for Social Science Inaugural Conference, and the Network Reading group at 

^Graduate School of Business, Stanford University and NBER, 


^Graduate School of Business, Stanford University and NBER, 

1 Introduction 

This paper studies the calculation of exact p-values for a large class of non-sharp null hypotheses 
about treatment effects in a setting with data from experiments involving members of a single 
connected network. For example, researchers might randomly assign some members, or clusters 
of members, of a social network to a treatment such as receiving information. We consider an 
environment where the following are observed; (i) the vector of treatments for all individuals in 
the network; (ii) the realized outcomes for all of the individuals (or possibly, a only a subset); (iii) 
all of the edges connecting individuals (where edges may potentially be categorized, for example 
into strong or weak edges); (iv) possibly, fixed characteristics for these individuals. Because 
the data come from a single network, with all units potentially connected and thus all units 
potentially affected by the full vector of treatments, establishing large sample approximations 
to distributions of statistics is challenging. This motivates our focus on the calculation of exact 
p-values based on the randomization distribution (Fisher, 1925). The validity of the p-value 
calculations does not depend on the network structure or the sample size. Although we focus 
on the case of an explicit network where edges at most belong to a small number of categories, 
the general methods we develop can be applied to more general settings with “interference” and 
some measure of distance between units, where the researcher is interested in testing hypotheses 
about the nature of interference and how it relates to distance. 

This paper considers a wide class of hypotheses about interference, sometimes caused by 
social interactions between units, where three categories of null hypotheses serve as leading 
examples. In all three, the hypothesis restricts the effects of the treatment of other units on a 
particular unit, while allowing for an individual’s own treatment status to have a direct effect. 
The first category specifies that the treatment status of units with network distance weakly 
greater than k do not matter; when k = 1, this requires that no other units’ treatments have an 
impact, when k = 2 only immediate neighbors’ treatments matter, while when k = 3 only neigh¬ 
bors as well neighbors of neighbors matter. These types of hypotheses have been considered in 
empirical applications — Bond et al. (2012) claim to find that “messages not only influenced 
the users who received them, but also the users’ friends, and friends of friends” (p. 295) — 
as well as in theoretical work, with many models a priori constraining spillovers in networks 
by ruling out effects of friends of friends (e.g., Toulis and Kao, 2013). The second category of 
hypotheses concerns the comparison between different categories of edges: e.g., under the null, 
only the treatments of neighbors with edges in one category matter. For example, Goldenberg, 
Zheng, Fienberg and Airoldi (2009) discuss a network dehned through email interactions be¬ 
tween Enron employees, with edges defined by the volume of email correspondence exceeding a 
threshold. Similarly, in analyses of large social networks researchers often sparsify the network 
by trimming edges between individuals with few interactions (see Thomas and Blitzstein, 2011, 
Bond et ah, 2012, and Eckles, Karrer, and Ugander, 2014). One can test whether such sparsifi- 
cation is appropriate by testing the hypothesis that there are no spillovers between individuals 
not connected according to one definition of edges, but who would be connected under a looser 
definition of edges. The third category of hypotheses concerns restrictions on heterogeneity 
in the impact of neighbors. For example, many models assume that only the number or frac- 

tion of treated neighbors matters for an individual’s outcome, not which of their neighbors were 
treated. An alternative of interest might be that neighbors with more connections matter more. 

There is a growing literature focusing on testing and inference in settings with general inter¬ 
ference between units, both theoretical and empirically However, there is no available general 
asymptotic theory that handles hypothesis tests about these categories of null hypotheses, and 
the nascent literature on estimation in network settings requires strong restrictions on the net¬ 
work size and structure!! In contrast, our primary goal is to test hypotheses about the impact 
of treatments in a network setting, without restricting the network. 

The main contribution of this paper is to expand the applicability of the “randomization 
inference” approach to calculating exact p-values, originally developed by Fisher (1925) and 
Rosenbaum (1984), to our hypotheses of interestlfl In the randomization inference approach, 
the distribution of a test statistic is generated by the assignment mechanism, keeping fixed 
the potential outcomes and characteristics of the units. This approach only applies directly to 
“sharp” null hypotheses, whereby the null hypothesis allows the analyst to infer the outcomes 
of individuals under alternative (counterfactual) treatment vectors. For example, the null hy¬ 
pothesis that the treatment has no effect whatsoever is sharp, because an individual’s outcome 
is known (and equal to his realized outcome) under all possible treatment vectors. Given this, 
it is possible to simulate draws from the random assignment of treatment vectors, and calculate 
the test statistic of interest under each simulated draw (in this example, a natural test statistic 
is the average difference in outcomes between treated and control individuals). The distribution 
of these simulated test statistics converges to the true distribution of the test statistic as the 
number of draws grows, and this true distribution is exact for the given network size and struc¬ 
ture rather than a large sample approximation. Thus, exact p-values for the null hypothesis 
of no treatment effects can be derived in a network setting using a conventional application of 
randomization inference. 

In contrast, the three leading categories of null hypotheses outlined above are not sharp be¬ 
cause under the null hypotheses we cannot infer the exact values for all outcomes for all possible 
values of the treatment; since all of the categories allow the treatment to have a direct effect 
on individuals, their outcomes cannot be inferred under alternative treatment assignments. In 
this paper we present a novel approach to dealing with such non-sharp null hypotheses. Closest 

^See Manski (1993, 2013), Christakis and Fowler (2007), Rosenbaum (2007), Kolaczyk (2009), Aronow (2012), 
Bond, Fariss, Jones, Kramer, Marlow, and Fowler (2012), Bowers, Fredrickson, and Panagopoulos (2012), Hud¬ 
gens and Halloran (2012), Ugander, Karrer, Backstrom, and Kleinberg (2012), Tchetgen and VanderWeele (2012), 
Goldsmith-Pinkham and Imbens (2013), Liu and Hudgens (2013), Aronow and Samii (2014), Choi (2014), Eckles, 
Karrer, and Ugander (2014), Ogburn and VanderWeele (2014) and van der Laan (2014). 

■^A small literature has emerged that posits a specific functional form model of network formation (and thus 
the process for how the network changes as the size of the network grows), and then proposes an approach for 
estimating the parameters of the network formation process (as opposed to parameters describing treatment 
effects). In a leading example, Chadraskhar and Jackson (2014) establish consistency and asymptotic normality 
of parameter estimates for network formation under certain conditions (e.g. network is sufficiently sparse for a 
class of models they call subgraph generation models). See also Holland and Leinhard (1981), Kolaczyk (2009), 
Manski (1993, 2013), Goldsmith-Pinkham and Imbens (2013), and Aronow and Samii (2014). 

®For applications of randomization inference outside the network setting, see Basu (1980), Rubin (1980), 
Rosenbaum (1995, 2002, 2007, 2009), Lehmann and Romano (2005), Imbens and Rubin (2015), and Canay, 
Romano, and Shaikh (2015). 

in spirit to this paper, Aronow (2012) adapts the randomization inference approach to consider 
the specific non-sharp null hypothesis that only an individual’s own treatment and that of his 
immediate neighbors matter, corresponding to the first category described above with k = 1. 
Here, we provide a general framework that applies to a much larger class of non-sharp null 

At an abstract level we address the problem that the null hypothesis of interest is not sharp 
by introducing the notion of an artificial experiment that differs from the experiment that was 
actually carried out. This artificial experiment will be chosen so that the randomization analysis 
we propose for it is validated by the design of the experiment that was actually carried out, and 
at the same time the null hypothesis of interest that was not sharp for the actual experiment, 
is sharp for the artificial experiment. In simple settings this idea of analyzing an experiment 
that differs from the experiment that was actually carried out is often used implicitly. Suppose 
we have an experiment where for each unit in a finite population a coin is flipped to determine 
the treatment assignment for that unit. Given the data, we may analyze the data as if the total 
number of treated units is fixed, whereas in the actual experiment the number of treated units is 
random. Analyzing the experiment as if the number of treated units is fixed is valid because we 
can think of the original experiment being a sequential one where in the first stage the number 
of treated units is determined by a sequence of coin tosses, and in the second stage the the fixed 
number of treated units is selected from the population at random. The artificial experiment is 
now simply the second stage of the original experiment, conditional on the first st^e. In this 
case there is no loss of information because the number of treated units is ancillaryo 

In the settings we analyze in the current paper we also decompose the original experiment 
into two stages, and we analyze the experiment performed in the second stage conditional on 
the first stage randomization. In an additional modification to the original experiment, we 
focus on a limited set of test statistics, namely those that depend on outcomes only for a 
subset of the original population, which we call the “focal units.” These changes to the original 
experiment lead to an artificial experiment where the null hypothesis that is not sharp in the 
original experiment is sharp for the artificial experiment, and where randomization inference 
is validated by the original experiment. The choice of the focal units on whose outcomes the 
statistic may depend and the decomposition of the original experiment into two stages are 
intricately linked to achieve the goal of defining an experiment with a sharp null hypothesis 
amenable to randomization inference. 

The choice of focal units will matter for the power of the tests, but for any choice of focal 
units our approach will lead to exact p-values. Given the choice of focal units, we derive 
the unique partition of the space of assignments into subsets such that the null hypothesis 
implies that the outcomes for all focal units must be constant within these subsets. Then the 
original experiment is re-interpreted as a sequential experiment where in the first stage the 
subset into which the assignment falls is determined, and in the second stage the assignment is 
drawn randomly from within the subset (with the likelihood of each assignment implied by the 
original experiment). The analysis of our artificial experiment then focuses on a test statistic 

^This is similar in spirit to Rosenbaum (1984), who carries out randomization tests conditional on covariates 
or functions thereof such as the propensity score. 

[ 3 ] 

constructed from outcomes for the subpopulation of focal units and relies on the second stage 
randomization, conditional on the randomization in the first stage, to construct the p-value for 
the test statistic. 

With our framework for testing in place, it is then possible to compare the statistical power 
of alternative test statistics. We do this for our three categories of hypotheses, and we propose 
statistics that will be optimal for particular models of network interactions. This in turn lays 
the groundwork for future research about optimal experimental design when the goal is to test 
a given hypothesis or set of hypotheses. 

The remainder of the paper is organized as follows. In the next section we introduce the 
general set up and notation. In Section [3] we discuss a number of the hypotheses that we 
consider. This is not an exhaustive list, but it contains what we view as leading examples 
of the hypotheses researchers may wish to consider in network settings. Section H] contains a 
general discussion of the notion of artificial experiments that lies at the heart of our approach. 
In the next four sections, Sections ME and [8] we discuss in detail how the approach would 
be implemented for the main categories of null hypotheses we consider. These details include 
discussions of the decisions researchers need to make regarding the choice of focal units and test 
statistics. In Section [9] we present the results from some simulations to evaluate the statistical 
power of the tests for alternative statistics. Section fTOl concludes. 

2 Set Up 

We have information on a population P of individuals, with i indexing the individuals. We 
also have a set of treatments W. In most of our examples each individual is either exposed 
to an intervention of not, although that is not necessary for some of the results. In that case 
for individual i the exposure is denoted by Wt € {0,1}, with W the Wcomponent vector of 
exposures with ith component equal to Wi, and W = {0,1}'^. There is mapping Y : W 
of potential outcomes, with the ith element of this mapping written as : W i—>■ Y, where 
Y C M is the set of values for the potential outcomes. We refer to I^(w) as a potential outcome, 
with the corresponding vector of potential outcomes denoted by Y(w). For the realized value 
of the assignment W we observe the corresponding vector of potential outcomes, 

Yobs ^ Y(W). 

The treatment exposure W is assigned through an assignment mechanism p : W i—)• [0,1], 
where p(w) is the probability of the assignment W taking on the value w, p(w) = pr(W = w), 
satisfying p(w) > 0 and EwgwP(w) = 1- 

The units are connected through a undirected network that is observed by the researcher. 
The symmetric N x N adjacency matrix G measures the network, with the (i,j)th element 
of the adjacency matrix, denoted by Gij, equal to one if there is an edge between units i and 
j, and zero otherwise. By convention all diagonal elements Ga are equal to zero. We will call 
individuals i and j neighbors or peers if Gij = 1. The network is taken here to be a fixed 
characteristic of the population. Let the distance d{i,j) between units i and j be length of 
the shortest path between i and j, and equal to oo if there is no path between i and j. Thus 

d{i,i) = 0, and d{i,j) = 1 if i / j and Gij = 1, d{i,j) = 2 if Gij = 0 if i j but there is 
at least one unit k such that Gik = 1 and G^j = 1, et cetera. A special case is that with 
non-overlapping peer groups, considered, for example, in Manski (1993, 2013), Hudgens and 
Halloran (2008), and Carrell, Sacerdote and West (2013), where for all triples (i,j,/c), Gij = 1 
and Gjk = 1 implies Gik = 1- We allow for such settings, but do not require them. Let G be 
the space of possible adjacency matrices. 

For each individual there is also a vector of attributes Xi, with the matrix of attributes de¬ 
noted by X. Both the network and the attributes are viewed as pretreatment variables, not af¬ 
fected by the treatment. We focus on the case where we observe the quadruple W, G, X). 

More generally we may observe outcomes for a subset of the population. The first two compo¬ 
nents of this quadruple, and W are random because of the randomization, the last two, 
G and X, as well as the potential outcome function Y(-) are taken as fixed. 

Let us think of an experiment for causal effects, denoted by £, being defined by a combination 
of the set W of possible values for the treatment W; the population P of units characterized 
by their potential outcomes, their network and their fixed attributes; and a distribution for the 
treatment assignment, p ; W [0,1], so that £ = (W,P,p(-)). 

3 Hypotheses 

In this section we discuss the three general classes of null hypotheses we consider, as well as some 
specific examples, and briefly discuss how p-values are calculated given a sharp null hypotheses. 
The classes of hypotheses are not exhaustive, but they include many of the hypotheses that 
we view as interesting in settings with networks and are suggestive of the generality of the 

3.1 Some General Concepts 

Let us start by formally defining several concepts: (i) a null hypothesis on treatment effects; 
(ii) whether a null hypothesis on treatment effects is sharp; (iii) level sets, that is, sets of 
assignments that result in invariant outcomes for a given individual. 

Definition 1 (A Null Hypothesis on Treatment Effects) A null hypothesis on treat¬ 
ment effects Hq is a set of restrictions on the potential outcome function Y : W i—)■ Y^. 

These restrictions can include the absence of any treatment effects, e.g., y(w) = y(w') for 
all w, w' and all i. They can also include more limited restrictions on the potential outcome 

Definition 2 (A Sharp Null Hypothesis on Treatment Effects) A null hypothesis on 
treatment effects Hq is sharp for (W,P) if, given the value o/(w, Y(w)) for a single assignment 
w G W, under Hq we can infer the value o/ Y(w') for any other w' G W. 

Now consider a test statistic, T : x W i—)■ R. For a given experiment £ = (W,P,p(-)) the 

test statistic T(Y(W), W) is random only through its dependence on the treatment (directly. 

[ 5 ] 

and indirectly through the dependence of the realized outcome on the treatment). We can infer 
the distribution of the test statistic for a sharp null hypothesis. The p-value for the statistic 
under the null hypothesis is then the probability that the realization of the test statistic is at 
least as extreme as the observed value: 

p-value = pr(^|r(Y(W),W)| > |r(Y(W°^"), . 

In most cases we do not have available a closed form expression to calculate this p-value exactly. 
However, we can approximate it arbitrarily accurately by taking B independent draws Wf, from 
the distribution of the assignment, p{-), and calculating the proportion of these B draws that 
would have led to value for the statistic larger than or equal to the observed value of the 

p:^e = - J](|r(Wb,Y(Wb))| > |^(w°b^Y(w°b^))|), 

for some large value of B. This estimate is unbiased for the true p-value, and its variance is 
bounded by 1/(4H), which can be made arbitrarily small by choosing B large enough. 

In some cases the statistic does not have a symmetric distribution under the null, and we 
may look at twice the minimum of the tail probabilities, 

p^^^^e = 2 X min|pr(^r(Y(W), W) > r(Y(W°’’"), , 

pr(^r(Y(W), W) < r(Y(W°’’"), 

Most of the null hypotheses we consider in this paper are not sharp. However, they imply 
that only a limited set of changes in the treatment actually change outcomes. To capture this, 
it is useful to introduce the notion of level sets, that is, sets of assignments with zero treatment 

Definition 3 (Level Sets) Given a null hypothesis Hq, for each individual i and for each 
treatment level w, define the level set V(i, w, LIq) as follows.■ 

V(i, w,iLo) = {w' G W|Yj(w') = Yi(w) given Hq}. 

Thus, the level set for unit i given treatment vector w is the set of treatments w' such that 
under the null hypothesis, the potential outcome for unit i is identical to the potential outcome 
given treatment wl§ (More generally we could define this set as the set of treatments where 
we can infer the potential outcomes, but outside the case where these potential outcomes are 
equal there are few cases of interest so we do not include that level of generality.) 

These level sets play an important role in our approach, and it is useful to see what form 
they can take. For the sharp null hypothesis that there is no treatment effect whatsoever, 

®Manski and Tamer (2002) make use of level sets for non-network data. Related work on networks makes use 
of some concepts directly related to level sets. Manski (2013) and Eckles, Karrer, and Ugander (2014) work with 
effective treatments, where each effective treatment corresponds to a level set, one of which is the observed level 
set. Aronow and Samii (2013) and Ugander et al. (2013) work with exposure models, which uniquely specify 
effective treatments. 

Y{i,'w, Hq) is equal to W for all i and all w. With non-sharp null hypotheses, however, the 
set V(i,w, i^o) rnay vary, both by individual and by treatment. For example, in the setting 
where W = {0,1}'^, if the null hypothesis allows for a direct effect of an individual’s own 
treatment, but not for any effects of other individuals’ treatment status, the set Y{i,w, Hq) 
equals {w' G W|w' = w,}, so that for each individual there are two possible values for the set, 
depending on the individual’s own treatment status. At the other extreme, if the null hypothesis 
does not impose any restrictions, then level sets consist of singletons: Hq) = {w}. 

Because within a level set the treatment effect is zero, we can in principle do randomization 
inference on treatment effects for that individual. 

It will play an important role later that in general for each unit i these level sets define 
a partition of the assignment space W into J level sets Wi,..., Wj such that for all w G W, 
V(i,w, Hq) G {Wi, ..., Wj}. If there are no restrictions at all, the elements of this partition 
consist of singletons, but in many interesting cases the number of elements of this partition is 
small. For example, for the null hypothesis that there are no spillovers, the partition contains 
two sets. 

3.2 Null Hypotheses on Spillovers 

We are interested in testing for the effect of exposure to the treatment for some individuals 
on the outcomes for others. We refer to such effects as “spillovers,” “interactions” or “peer 
effects.” In the case where they are limited to the effects of direct neighbors, the peer effects 
we study are what Manski (1993) calls “exogenous peer effects.” 

First we consider the following three specific hypotheses that allow for a range of spillovers. 
Recall that in general the hypotheses we consider are restrictions on the mapping Y : W i—)• . 

Hypothesis 1 (No Treatment Effects) Yi(w) = Yi(w') for all i, and all pairs of assign¬ 
ments w, w' G W. 

This is a sharp null hypothesis in the original experiment, because for all w' G W the po¬ 
tential outcomes li(w') can be inferred from the observed treatment and observed outcomes 
(w, Y(w)) under the null hypothesis. Thus, the calculation of Fisher exact p-values is concep¬ 
tually straightforward. 

Next, we consider a weaker null hypothesis that allows for effects of the own treatment on 
the own outcome, but not of the own treatment on a neighbor’s outcome: 

Hypothesis 2 (No Spillovers) Yi(w) = Yi{w') for all i, and all pairs of assignment vectors 
w, w' G W such that Wj = w'. 

This null hypothesis is the one considered by Aronow (2012). It is not sharp, because it does 
not rule out that exposure to the treatment affects the outcome for the unit exposed. Manski 
(2013) refers to settings where this hypothesis holds as settings with “individualistic treat¬ 
ment response.” This null hypothesis is implied by the stable-unit-treatment-value-assumption 
(SUTVA, Rubin, 1980). Under this assumption we can simplify the notation to the conven¬ 
tional one in the causal effect literature where the potential functions are a function of the own 

[ 7 ] 

treatment only, li(w) = li(wj). Because we consider more general cases, we continue to write 
the potential outcomes as a function of the full A^-component vector w. 

We can go beyond hypotheses ruling out all spillover effects, and allow for first order, but 
not higher order, spillover effects. That is, changing the treatment for neighbors may affect 
one’s outcome, but changing the treatment for neighbors-of-neighbors does not change one’s 

Hypothesis 3 (No Second and Higher-Order Spillovers) li('w) = Yi{w') for all i, 
and for all pairs of assignment vectors w, w' E W such that Wj = w'- for all units j such that 
d{i,j) < 2. 

Consider the following example where testing for higher order spillovers may be interesting. 
Suppose one can observe one’s own treatment as well as the treatment of one’s network neigh¬ 
bors, for example because of face-to-face interactions. One can also observe one’s own outcome, 
but not the outcome for neighbors. It may well be that in such cases there are spillover effects 
from neighbors, but no spillover effects from neighbors-of-neighbors or individuals even more 
distant in the network. Testing for higher order spillover effects could then be interpreted as 
testing whether the network captures all the connections. 

Some theoretical models (e.g. Tonlis and Kao, 2013) model spillover effects in way that 
rules out higher order spillover effects. At the same time some researchers claim to find higher 
order spillovers effects in empirical work (e.g.. Bond et ah, 2012). Our tests are the first exact 
tests available for such hypotheses. 

We can embed these three hypotheses in a more general one that restricts k-th order spillover 
effects for arbitrary k. 

Hypothesis 4 (No A:-th and Higher Order Spillovers) For unit i, for i = 1,...,N, 
Kj(w) = Yi{w') for all pairs of assignment vectors w, w' E W such that Wj = w'j for all units j 
such that d{i,j) < k. 

(Here we interpret the set of pairs w and W such that Wj = w( for z E 0 as the set of all w and 
w'.) The assumption of no effects (Hypothesis 0] with A: = 0) is equivalent to Hypothesis 01 and 
the assumption of no first and higher order peer effects (Hypothesis 0] with A: = 1) is equivalent 
to Hypothesis O and the Hypothesis of no second and higher order peer effects (Assumption 0] 
with A: = 2) is equivalent to Hypothesis 01 

We can also test the hypothesis that there are no direct effects of the own treatment, while 
allowing for indirect effects from neighbors. 

Hypothesis 5 (No Diregt Effegts) Yi{'w) = Yi{w') for all i, and for all pairs of assignment 
vectors w,-w' E W such that Wj = w' for all units j such that d{i,j) = 1. 

The most interesting version of this nnll hypothesis might be to test whether the direct effect of 
the treatment is zero for individuals whose neighbors are all in the control group. This would 
imply that there could only be a direct effect of the treatment for individuals with at least 
some treated neighbors. This may be natural in cases where the treatment is some service that 
requires interacting with other individuals who have the service. 

[ 8 ] 

3.3 Null Hypotheses on Sparsification and Competing Networks 

In the second class of null hypotheses we start with two networks, corresponding to adjacency 
matrices Gi and G 2 . In some cases of interest these may be nested networks, with Gi^ij < G 2 ,ij 
so that Gi is a sparsified version of G 2 . Suppose we ask individuals whom they regularly 
interact with, as well whom they have ever interacted with. The first network would define 
edges using the first question, and the second network would use the second question. For 
example, researchers have used data on emails between employees at Enron to define a network 
in terms of a threshold for email volume (Goldenberg, Zheng, Fienberg and Airoldi, 2009). 

Alternatively the two networks could correspond to distinct measures of interactions without 
necessarily being nested, so that for some pairs (i,j), we have Gi^ij > G 2 ,ij whereas for other 
pairs we have Gi^i/ji < G 2 ,i'j'- For example, one network definition may be based on 

email interactions, where another network definition is based on instant messaging interactions, 
or face-to-face interactions. 

We consider the null hypothesis that there is no effect on unit i of the exposure of unit j if 
i and j are neighbors in the second network G 2 , while allowing for effects on the outcome for 
unit i of exposure for units j to whom unit i is a neighbor in the first network Gi. 

Hypothesis 6 Tj(w) = Yi{w') for all i, and for all pairs of assignment vectors w, w' G W 
such that Wj = Wj for all units j such that Gi^ij = 1. 

3.4 Null Hypotheses on Peer Effect Heterogeneity 

Many models of peer effects assume not only that only direct neighbors can influence an individ¬ 
ual’s outcomes, but also that for any individual it is only the number of treated neighbors that 
matter, not which of their neighbors got treated. In other words, if we take an individual i with 
two neighbors, j and j', the outcome for individual i given assignment w with (wj = 0, wy = 1) 
is the same as the outcome given assignment w' with (w' = l,w', = 0). Such hypotheses are 
maintained in many structural models of peer effects, for example the linear-in-means models 
considered in Manski (1993, 2013). 


Hypothesis 7 (No Peer Effect Heterogeneity) ^^(w) = T)(w') for all i, and for all 
pairs of assignment vectors w, w' G W such that ‘ ^ij = ‘ Gij. 

An interesting alternative hypothesis could be that in terms of their effect on outcomes for 
individual i, high-degree neighbors of i are more or less influential than low-degree neighbors 
of i. This hypothesis implies no second and higher order peer effects, but it is stronger than 
that. It restricts the range of first order peer effects that is allowed. 

A related hypothesis implies that all that matters is that at least one neighbor is exposed to 
the treatment, and that treating additional neighbors does not affect an individual’s outcome. 

Hypothesis 8 (Threshold Peer Effects) ^^(w) = li(w') for all i, and for all pairs of 
assignment veetors w, w' G W sueh that 1 Wj • Gij > 0 } = 1 w' • Gij > 0 }. 

Here an interesting alternative hypothesis could be the number of treated neighbors matters. 

[ 9 ] 

Hypotheses: Artificial Experiments 

This section contains the main conceptual contribution of the paper. We describe at an abstract 
level our approach to the problem of non-sharp null hypotheses. This solution is based on 
analyzing an artificial experiment that differs from the experiment actually conducted. The 
artificial experiment is chosen to satisfy two conditions. First, it is chosen so that the original 
null hypothesis, which was not sharp for the original experiment, is sharp for the artificial 
experiment, and second, it is chosen so that the randomization-based analysis of the artificial 
experiment is validated by the design of the original experiment. 

We start with an experiment T, consisting of a set of values W for the assignment W, a 
population P with N units, and an assignment mechanism p : W i—)■ [0,1]. Although in our 
applications the set W has the structure W = {0,1}^, this need not be the case in general. In 
addition we have a null hypothesis Ho that places restrictions on the function Y : W e-)■ . 
Instead of testing Hq with the data from this experiment using the randomization distribution 
implied by p{-), we will analyze a different, artificial, experiment, for which the randomization- 
based analysis is validated by the design of the original experiment. Let the artificial experiment 
be denoted by . The difference between the artificial experiment and the original experiment 
has three components. Only one is a choice of the researcher; the remaining two follow from 
the combination of that choice, the original experiment, and the null hypothesis of interest. 

In general test statistics are functions T : x W x x G i—)■ M, which are evaluated at 

(Y(W),W,X,G). The first step is to restrict the population whose outcomes the test statistic 
is allowed to vary with. We denote this subpopulation by Pi;’, and refer to the individuals in this 
subpopulation as the focal units, with Fi an indicator that is equal to one for focal units and zero 
otherwise. In the special case where the null hypothesis is that of no spillovers at all, the focal 
subpopulation corresponds to the subpopulation of hxed units in Aronow (2012), who refers to 
its complement as the variant units. However, because in our approach the artificial experiment 
may also need to hold hxed the treatment assignment for some units outside the subpopulation 
of what Aronow calls the hxed subpopulation, we use a different terminology. At this point the 
choice of focal subpopulation is arbitrary. Its choice does not affect the validity of the resulting 
p-values, but as we shall discuss below, it has a major impact on the power of the test. Let 
Np be the cardinality of the set Pj?, let Yf{w) denote the A^p-vector of potential outcomes 
for the focal units for any treatment w, and let Y^^® = Yi?(W) be the vector of realized 
outcomes for these units given the actual assignment W. The selection of this subpopulation 
can depend generally on the hxed characteristics of the population X, and the network G. It 
cannot depend on the assignment W either directly, or indirectly through dependence on the 
realized outcome Y°'^®. We now consider test statistics T : Y^^ x W x x G i—)■ M, evaluated 
at (Yi.(W),W,X,G). 

Given the focal subpopulation Pi;’ and the null hypothesis Hq, dehne the set of subsets of 

§ = u^gw|nigp^v(f,w,Lio)|. 

[ 10 ] 

This set plays a key role in our approach. An important property is that it is a partition of W. 

Proposition 4.1 (Partition of the Assignment Space) S is a partition ofW. 

Proof: Because w E nigp^V(i, w, Ho); h immediately follows that UvgsV = W. Thus the 
remaining property to be established is that either (njgp^V(i, w, Hq)) n (nigPj^V(i, w', Hq)) = 0 
or nigp^V(i, w,i4o) = ni6P^V(i,w',Fo)- If (niePj,V(i, w, i^o)) n (ni6Pj,V(i, w', i^o)) is not 
equal to the empty set, there must be a w" E V(i,w, I/q) and w" GY{i,w',Ho). Then 

Yp’(w") = Yf{w') = Yf(w). (4.1) 

Hence if there is another element w'" E Y{i,w', Hq), it must be the case that 

Yf{w'") = Yp’(w'). 

By (|4.1I1 this is equal to Yi7’(w"), and also be (14.11) this is equal to Yp’(w). Hence it must be 
the case that 

Yf{w"') = Yf{w") = Yf{\v') = Yf(w), 

and w"' E Y{i,-w, Hq). Therefore r\i£PpY{i,w, Hq) = njgp^V(i, w', Hq)) which finishes the 
proof. □. 

The third component of the artificial experiment consists of a new assignment mechanism 
p' : W i-A [0,1]. To define this third component we decompose the original experiment into a 
stratified experiment. Given the partition §, define the stratum indicator S' : W i-A {1,..., J}, 
so that the stratum is S(w) = j if w E Wj. Now we can think of the original experiment £ 
as a stratihed experiment where we hrst draw the stratum S, with pr(S = j) = pr(W E Wj), 
followed by the second stage where we draw W conditional on S, with 

y(w) = pr(W = w|S = j) 




if pr(S = j) > 0, w E Ws, 

Now we propose to analyze the artificial experiment £' = {Ws,^f,p'{'))■ Th® set of restric¬ 
tions on the values the function Y : W eA that corresponds to the original null hypothesis 
translates into a set of restrictions on the values of the function Yi? : Wg i-A Y^^ which gives 
us the implicit null hypothesis for the new experiment. By contstruction, the set of assignments 
W and the focal population ¥f are chosen so that the null hypothesis is sharp for this artifi¬ 
cial experiment. Formally, for any pair (w,Yp’(w)) with w E W 5 , we can infer the values of 
Yp’(w') for any other value w' E W 5 . We discuss some examples of this in the next section. We 
then choose a statistic T : Y^^ x W x X'^ x G eA M that depends only on the outcomes for the 
individuals in the focal population, Y^^ = Yir(W). We calculate p-values for this statistic by 
comparing the realized value of the statistic, = T(Y^*’®, W, X, G), to the randomization 
distribution for T(Yi?(W), W, X, G) induced by the modified assignment distribution p'{-). 

A key insight is that a randomization-based analysis of the artificial experiment £' is vali¬ 
dated by the design of the original experiment £. Let us consider the two modifications-changing 


the population and using a conditional assignment mechanism-in turn and justify this claim. 
Choosing a subpopulation of units based on fixed attributes or pretreatment variables such that 
the test statistic varies only with outcomes for these units does cannot invalidate the p-value 
because it is valid for any statistic. Second, consider the change in the assignment mechanism. 
We can think of the original assignment mechanism, corresponding to the distribution p{-), as a 
two-stage procedure: hrst we choose S, and then the actual assignment is determined either by 
drawing according to p'{-) where p'(w) = pr(W = w|W G Ws). Thus the artificial experiment 
conditions on the value of S and only exploits the second stage randomization. In general this 
may discard information, but it does not affect the validity. 

Here we discuss how exact p-values can be calculated for the hypotheses introduced in Section 
[3l given randomized assignment of the treatments. To simplify the discussion we focus in this 
section initially on a completely randomized experiment, where M units out of N are randomly 
selected to receive the treatment (see Imbens and Rubin, 2015 for a general discussion). In 
Section [5.51 we discuss extensions to clustered randomized experiments. 

Assumption 5.1 (Random Assignment) 
pr(W = w) = l/ ( ^ , 

for all w G {0,1}'^ such that ^■ 

To set the stage, let us first consider the case where we test the null hypothesis of no 
treatment effects whatsoever. In that case for each individual V(i,w, f^o) = W, we can take 
the subpopulation of focal units to be the entire population, Pj? = P, and the partitioning is 
S = {W}. Then the assignment mechanism is the same under the artificial experiment as it is 
under the original experiment, p'{-) = p(-), and thus the artificial experiment is identical to the 
original experiment. 

5.1 Exact P-values for the Null Hypothesis of No Spillovers when the Net¬ 
work consists of Dyads 

To develop some intuition for the problem we first look at the case where the network has 
a simple structure. Suppose the population consists of N units paired into N/2 dyads. For 
individual i let ^{i) G {1,..., N} be the index of the neighbor of individual i. We are interested 
in testing the hypothesis that there are no spillover effects (Hypothesis [2]) , allowing for the 
possibility of direct effects of the own treatment on an individual’s own outcome. 

5.1.1 The Artificial Experiment 

To create the artificial experiment £' we first select the focal subpopulation. We do this by 
selecting one member from each pair, and designate that individual in the pair as the focal 


individual. This selection can be random, or based on pretreatment variables, but not on 
outcome or assignment data. Let Tj = 1 if an individual is a focal individual and Tj = 0 for 
non-focal, or auxiliary individuals. Selecting one focal unit from each pair is not required for 
our approach, but it makes intuitive sense. If both members of a pair are focal units, then the 
level sets imply that we cannot vary the treatments for any member of the pair in the artificial 
experiment. If neither member of the pair is focal we do not use the outcomes for the two units. 
In both cases the pair is essentially dropped from the analysis, so only if there is a single focal 
unit in each pair does the pair enter in the analysis. 

In the second step, we dehne the restricted set of assignments W 5 . Let W be the full 
assignment vector. For individual i, Hq) = {w' G W|wj = w'}. Hence 

Ws = nj 6 P^V(i, W, Ho) = {w G W|Wj = Wi for all i G Fp}, 

allowing only the treatments for the non-focal, or auxiliary units, to vary. Let Mp = '^i-p.^i Wi 
be the number of treated focal individuals, and M — Mp the number of treated auxiliary 
individuals. Then, because there are N/2 auxiliary individuals, the distribution of assignments 
p'{-) in the artihcial experiment satisfies 

p'(w) = pr(W = w|5) = 

for w G Ws, and zero otherwise. 

Given the experiment we consider test statistics T : xWxXxGi— t-M. For any statistic 

in this class we can infer its distribution under the null hypothesis. We would like to choose 
the statistic whose distribution is sensitive to interesting departures from the null hypothesis. 
We consider two statistics, motivated by parametric models that allow for spillover effects. 

5.1.2 Test Statistics 

Consider a model for the potential outcomes that does not impose the null hypothesis of no 
spillovers. In that case, with a single neighbor for each individual, the potential outcome for 
individual i can be written as a function of the own treatment Wi and the neighbor’s treatment 
W£(q, or, Yi{w) = Fi(wj, A natural starting point is to assume that both direct (own) 

and indirect (neighbor’s) treatment effects are constant and additive: 

Yi{wi, u;<>(i)) = a + Tdirect • Wj + Tspill • + Sp (5.2) 

Given this parametric model the null hypothesis of no spillovers corresponds to Tgpin = 0. To 
find a statistic with good power properties for testing our nonparametric null hypothesis of no¬ 
spillovers, we can look at the Lagrange multiplier or score test statistic for the null hypothesis 
Apiii = 0 in this parametric model, assuming homoskedasticity, normality and independence for 
the Si- The validity of our proposed testing procedure does not rely on these parametric and 
distributional assumptions, but if they hold, the fact that in that case the test corresponds to 
a Lagrange multiplier test would endow the procedure with large sample efficiency. 


In this parametric model the likelihood function for the focal units is 



where is the variance of e*. The sum of the scores, that is, the sum over the focal units of 
the derivative of the logarithm of the density under this model with respect to Tspiu, evaluated 
at Tspiii = 0, is equal to 

^ ^ E ^direct • Wi) . 

[L(g 5 0^5 ^direct; "^spill) — 


The statistic we focus on is this sum with a and Tdirect replaced by estimates based on the 
outcomes for only the focal units. These estimates are 

^ -r^bs ^ -T^bs -r^bs 

= y _P^0’ '^direct — ^ F,1 ~ ^ F,05 

where, for tc = 0,1, is the average outcome for focal units with Wi = w and is the 

number of focal units with Wi = w. This leads to the statistic, after normalizing by the number 
of focal units, 

^dyad _ 











“ ^ F ,0 

• Wi 



This statistic is interpreted as the correlation between the neighbors’ treatment status and 
the focal unit’s outcome, adjusted for the average value of the outcome for focal units with the 
same treatment status. 

Although such a model appears substantively less plausible, it is also interesting to consider 
the model in (15.2p without a direct effect: 

F('lCj, O + Tspill ■ -(- Ei- 


Then the Lagrange multiplier approach leads to the statistic 

= w E ■ (^i*' - W) 

^ i.Fi=l 






where for w = 0,1, Y is the average outcome for focal units with neighbors whose treatment 
status is W£(j) = w and Nf,{w) is the number of focal individuals whose neighbor has treatment 
status w. Hence the statistic essentially compares average outcomes for focal units with treated 
neighbors and focal units with control neighbors. We refer to this statistic as an edge-level- 
contrast statistic for reasons that will become clear below when we generalize the network 

The first statistic, Tscore; yields a more powerful test when there are direct effects of the 
treatment, because it adjusts for the estimated direct effects of treatment. Failing to do so 
introduces additional noise in the distribution of the test statistic. 


eral Networks 

In this section we consider the more general problem of testing for spillover effects in an un¬ 
restricted network setting. We maintain the assumption that the randomization is at the unit 
level, with M randomly selected units out of the population of N units exposed to the inter¬ 
vention. As before we choose a subpopulation of focal individuals whose outcomes we use, with 
the complement of this subpopulation the set of auxiliary individuals. This selection may be 
random or depend on pretreatment variables. The restricted set of assignments fixes the assign¬ 
ments for the focal individuals: Ws = {w G W|wi? = W^®}, allowing only the treatments for 
the non-focal or auxiliary units to vary. There are two substantive differences with the setting 
where the network consists of pairs. The choice of the statistic is more complicated, and so is 
the choice of the focal subpopulation. 

5.3 Test Statistics 

We consider three test statistics. The first is a modification of a test statistic previously proposed 
by Bond et al. (2012); the second is optimal for a particular data-generating process; and the 
third is a modification of a statistic proposed by Aronow (2012). 

5.3.1 The Edge-Level Contrast Statistic 

The first statistic we consider is a modification of an edge-level statistic used by Bond et al. 
(2012). Bond et al. test for the presence of spillovers using the randomization distribution 
based on the null hypothesis of no effects of the treatment whatsoever. The statistic they use 
is equal to the difference between the average, over all edges where the alter is exposed to the 
treatment, of the ego’s outcome and the average, over all edges where the alter is not exposed 
to the treatment, of the ego’s outcome; 

rB(W,Y°'"",G) = 

We cannot infer the randomization distribution of this statistic if we only impose the null hy¬ 
pothesis of no spillovers but allow for direct effects of the treatment (which is the null hypothesis 
of interest). Bond et al. report p-values based on the additional assumption that there are no 
own effects of the treatment. Without this additional assumption the p-values reported based 
on this statistic are therefore not generally valid. In Appendix A we provide analytical calcu¬ 
lations that show that the size distortions for this statistic can be substantial in the presence 
of direct effects of the treatment, as high as 0.2 for a nominal 0.05 level test in simple cases. 

However, we can modify the Bond et al. statistic, averaging only over the subset of edges 
where the ego is in the focal subpopulation and the alter is in the auxiliary subpopulation 
(in the current setting where we test the null of spillovers this subpopulation is equal to the 
complement of the focal subpopulation): 




Fi-a„-{i-Fj)-Wr -F.■ Cj, ■ (i- f^)■ (i- ty^)■ r.°i» 

We refer to this as the edge-level-contrast statistic. In the case where the network consists of 
dyads, it reduces to our second test statistic for the case of dyads, in (j5.5p . 

5.3.2 A Score Test Statistic 

We motivate the second test statistic in a more systematic way with a structural model for 
treatment effects. Suppose we use a simple linear model, a simplified version of the linear-in- 
means model of the type discussed in Manski (1993, 2013) with only exogenous peer effects: 


= ao + Tdirect ' W, + Texo ' ^ Wj ■ G,j + £i, (5.7) 


where Gij = Gijl is a normalized indicator for links. (If then 

Gij = 0.) Hence '^f=i Wj ■ Gij is the fraction of treated friends. 

Testing for spillovers in the context of this model corresponds to testing the parametric null 
hypothesis that the exogenous peer effects parameter Texo is equal to zero. A natural way to 
derive a powerful test statistic for Texo = 0 in a parametric model, and the basis of Lagrange 
multiplier tests, is to derive the average score for Texo, evaluated at Texo = 0 and estimates 
for the nuisance parameters (ao and Tdirect in this case). Under the model in (j5.9|) the score 
statistic is proportional to the covariance between the residual under the null and the fraction 
of neighbors who are treated, Gij ■ lUj, leading to 

/ ^ _ 

Tscore = Cov I - d - Tdirect ' 

Remark 1 If the network consists of dyads, with one unit in each dyad designated focal and 
the other auxiliary, then this statistic is identical to the statistic Tsmre in (15.3h . As in the case 
of dyads, this test statistic reduces variance in the test statistic by normalizing outcomes by 
the estimated direct effect of the treatment, at least when direct effects of the treatment ar 
present. □ 

Remark 2 Note that our approach to deriving the test statistic can be applied to alternative 
structural models with different functional forms for outcomes, the nature of spillovers, etc., 
and as above, the test statistic is valid irrespective of the validity of the structural model. The 
power of the test, however, will depend on the quality of the model. □ 

Remark 3 It is also interesting to note that the same score statistic applies to a different model. 
Suppose we start with a different version of the linear-in-means model of the type discussed in 
Manski (1993, 2013): 

y^obs _ . Wi + Tendog ' ^(i) + (5-9) 


where Y is the average outcome for z’s neighbors. In this model the spillovers arise from the 
direct effect of one’s own treatment on one’s own outcome (if Tdirect 7 ^ 0 ), combined with what 





0, K; = 1 



Manski calls endogenous effects of the neighbors’ outcome on the own outcomes (xendog)- This 
implies that treatment exposure for non-neighbors can affect one’s outcome if the non-neighbor 
are connected through other individuals, with the magnitude of the spillover effects depending 
on the distance between the individuals in the network. Although this endogenous peer effects 
model implies that spillover effects propagate throughout one’s network, the score statistic for 
this model is identical to that in (15.91) . because close to the null of no spillover effects the 
effects are dominated by those of direct neighbors. Details for this calculation are presented in 
Appendix B. □ 

5.3.3 The Has-Treated-Neighbor Test Statistic 

As the third test statistic, we consider a variation on a statistic based on distance to the nearest 
treated unit. Aronow (2012) proposes a test statistic for spatial or network interference that is 
the correlation between outcome for focal units and the distance to the nearest treated auxiliary 
unit. If distance is defined in terms of hops between two units in a network and there are many 
treated units, then much of the variation in this measure will be between having a treated unit 
in one or two hops. So we analyze a related statistic the uses, instead of the distance to the 
nearest treated unit, an indicator for whether any of a unit’s non-focal neighbors are treated. 
This statistic is the correlation between this indicator and the outcome, both for focal units: 

^ Syobs ■ 5 ta ^ -^f) ■ ^j:^G,yWrii-F,)>o, 

where Syobs and S'ta are the sample standard deviation of the outcome for focal units and the 
standard deviation for the indicator, for focal units, of having at least one treated auxiliary 
neighbor. Like the edge-level contrast statistic, this statistic does not adjust for estimated 
direct effects of the treatment. 

5.4 Choosing the Focal Subpopulation for the Null Hypothesis of No Spillovers 

A key feature of our approach is that the researcher needs to choose a focal subpopulation. This 
choice, in combination with the null hypothesis, determines the randomization distribution in 
the artihcial experiment. Although the p-values are valid irrespective of the choice of focal 
subpopulation, this choice may affect the power of the testing procedure substantially. 

Here we discuss some algorithms for choosing the subpopulation of focal units, where the goal 
is to maximize the power of the test. In general the power will depend on a number of features of 
the problem. First, it will depend the alternative hypothesis, for example whether the spillover 
effects are linear in the number or the proportion of treated neighbors. Second, the power 
will depend on the choice of statistic. The power will also depend on the network structure. 
Finding the focal subpopulation that optimizes power for particular choice of alternative and a 
particular test statistic is a difficult problem. Here we discuss some issues and suggest general 
solutions that may have good power in a wide range of settings. 

In the case of testing the null of no spillovers, there are three general principles that apply 
irrespective of the specific alternative hypothesis and test statistic. First, because the artificial 


experiment considers only change in the treatment for auxiliary individuals, it is important that 
there are a substantial number of auxiliary individuals. Second, because the statistic depends 
only on outcomes for focal units, it is important that there is a substantial number of focal 
units. Third, because the alternative hypothesis involves spillovers from treated alters to focal 
egos, and because only changes in the treatment for auxiliary individuals are considered, it is 
important that there are many edges between focal and auxiliary individuals. These principles 
were helpful in the dyad case, where they suggested selecting a single focal individual in each 
pair. Some settings may also have additional constraints that guide the selection of focal units. 
For example, we might only observe the outcome for a small fraction of the units even though 
the treatment is observed for all units (e.g.. Bond et al. (2012) only observe voting status for 
about 10% of their population). 

5.4.1 Random Selection 

As a baseline method we randomly choose 50% of the population to be focal, with the remainder 
auxiliary, without regard to the network structure. 

5.4.2 Selection Based on e-Nets 

In the second approach to focal unit selection, we aim to select a large set of focal units that are 
not adjacent to each other. In particular, we use a method for finding an e-net (see, e.g., Gupta, 
Krauthgamer and Lee, 2003), or a set of points that is both an e-packing and an e-covering, 
with e = 2 !^ To define an e-net on a graph, we let B^{i) = {j : d{i,j) < e and j G P} be the 
set of all vertices within e hops of vertex i. 

Definition 4 (e-NET in a graph) An e-net is a set of vertices § C P such that: (a) the 
vertices are mutually at distance at least e from each other, d{i,j) > e for all i,j G §; and (b) 
the union of all of their e-balls covers all vertices, P C Uig§R£(s). 

Ugander, Karrer, Backstrom, and Kleinberg (2013) describe a greedy method for finding an 
3-net, which can be generalized to find a e-net for other values of e. To find a 2-net, we do 
the following. Starting with an empty set of focal units and an empty set of auxiliary units we 
randomly select a seed for the e-net. Given the new seed we assign it to the focal subpopulation, 
and we assign all of its neighbors to the auxiliary subpopulation. If at that point all individuals 
are assigned to either the focal or the auxiliary subpopulation we stop. If not, we randomly 
draw another seed to be assigned to the focal subpopulation and assign all its neighbors to the 
auxiliary subpopulation. We continue randomly selecting new seeds until all individuals are 
assigned to either the focal or auxiliary subpopulation. This greedy algorithm leads to a set of 
focal units that are not neighbors. 

®A 2-net is also called an independent set and the greedy algorithm we give here constructs a maximal 
independent set. We describe this in terms of e-nets because larger values of e might be used when testing other 
hypotheses about spillovers. 


5.4.3 Maximizing the Number of Edge Comparisons 

In the third approach we choose the focal subpopulation by attempting to maximize the number 
of focal-auxiliary edges, 

N{F,G) = J2Fi-Gij-il-Fj), 

leading to 

F* = argmaxiV(F, G). 


This approach ignores the fact that the average over the edges may involve multiple edges with 
the same ego. This would not change the optimality if the number of focal-auxiliary edges were 
the same for all focal individuals, but if there is substantial variation in the number of such 
edges one might do better taking that into account. 

Solving this problem exactly is computationally demanding, so we approximate it by using 
a greedy algorithm. We start by assigning all units to the auxiliary subpopulation, so that 
there are no focal-auxiliary edges. We then calculate for each non-focal unit the number of 
focal-auxiliary edges that would get added if unit i gets moved to the focal subpopulation. 
Next, add the individual to the focal subpopulation who bring the biggest gain. This 
process continues until no additional focal unit would increase the number of focal-auxiliary 

Suppose we have an initial focal subpopulation F. For auxiliary individual i consider adding 
them to the focal subpopulation. That would change N(F, G) by the number of the auxiliary 
neighbors of i minus the number of focal neighbors of i: 

^N,i = KA,i — Kp^i. 

This puts a premium on selecting focal units with a larger number of edges. Because we 
consider settings where it is the fraction of neighbors that are treated that matters for the 
spillover effects, rather than the total number, we modify this criterion by dividing it by the 
number of neighbors, and selection as an additional focal unit the one with the highest value 

j- KA,i - Kp^i 

= —K, -■ 

In regular graphs (i.e., where all units have the same number of neighbors) this change does 
not matter, but it does in settings with where the degree distribution has a positive variance. 
Thus, we sequentially add to the set of focal units the unit i, among those currently not in the 
focal subpopulation, who has the highest value for 5^,1-, until there is no auxiliary unit with a 
positive value for djv,*- 

In settings where the network consists of dyads, both the e-net approach and maximizing 
the number of edge comparisons leads to the same result: in each dyad one randomly selected 
vertex will be the focal unit and the other vertex in the dyad will be the auxiliary unit. In that 


case the random selection of focal units without regard to network structure will be substantially 
less powerful by allowing for the possibility that both individuals in a dyad are focal or that 
both are auxiliary. 

There are more general connections between this method and the 2-net method. With the 
modified, fractional criterion this method first selects a 2-net as the focal units and then 
continues to add focal units. That is, this method allows using a larger set of focal units than 
would be selected by finding a 2-net. 

5.5 Exact P-values for Spillovers with Clustered Random Assignment 

Now suppose the randomization is more complex than the one considered in the previous section, 
where we randomly selected M units out of the population of N to receive the treatment. 
Of particular interest is the generalization with clustered randomization. In this case the 
population is first partitioned into K clusters. Pi,... ,1Pk, with P^ C P, PfcOP/ = 0 if /c / /, and 
U^j^Pfc = P. This partitioning may depend on the network structure. In fact, in graph cluster 
randomization, the partitioning is often chosen so as to heuristically maximize the fraction of 
edges within that are within clusters, subject to other constraints (e.g., cluster size), or other 
related quantities, such as modularity (Newman, 2006). See Eckles, Karrer, and Ugander (2014) 
and Ugander, Karrer, Backstrom, and Kleinberg (2013). Let Ci £ C = {1,... , K} indicate the 
cluster that individual i belongs to. In the next step, M of the K clusters are assigned to the 
treatment group, implying all units in those M clusters will be exposed to the treatment, and 
the remaining units will be assigned to the control group. More generally, we may consider an 
unrestricted distribution for the assignment vector W, specified by the function p : W [0,1] 
for some set of assignments W that is different from one that assigns equal probability to all 
assignments with M treated and M — N control units. 

For the original experiment the clustering does not change the fundamental approach. If 
we are interested in testing a sharp null hypothesis such as the null hypothesis of no effect 
of the treatment whatsoever, we can use exactly the same statistics. The only difference is 
that when we calculate the distribution of the statistic under the null, we now do so under 
the assignment mechanism defined by the clustering. Because many assignment vectors w 
that are possible under complete randomization are ruled out under cluster randomization, the 
clustering typically reduces the power of the tests. This issue is even more of a concern for 
testing null hypotheses regarding spillovers. We again select a focal subpopulation Fp C P. 
For each individual calculate the set of assignments that do not change the outcome for that 
individual under the null hypothesis, V(z, w, Hq). The restricted set of assignments is, as in the 
general case, the intersection of these sets over all focal individuals: 

Ws= n V(z,W,Ro). 

The distribution of the assignments in the artificial experiment is, as before, the conditional 
probability given that W G W/j: 



Ew'eWs Pi^') ’ 


for w G W 5 , and zero elsewhere. The artificial experiment is now characterized by the triple 

For any statistic T : Ws x x X x G 1 —>■ M, we can infer its exact distribution under the 
null hypothesis of no spillovers, using the randomization distribution induced by the clustered 
randomization. Thus we can use the same statistics as before, e.g., the edge-level-contrast 
statistic or the score statistic. The change in the distribution of the treatment affects the power 
of the tests, but does not fundamentally change the approach. 

To illustrate what practical issues the clustered randomization raises, consider the edge- 
level-contrast statistic Tele. This statistic is equal to the difference in the average outcome for 
focal units over all edges between one focal unit and one auxiliary unit, where the auxiliary unit 
is treated and the average outcome for focal units over all edges where the auxiliary unit is in 
the control group. Because treatments for units in the same clusters as focal units do not vary in 
Ws because of the cluster randomization, the power of the tests will be severely reduced if the 
clusters are constructed in such a way that there are few between-cluster edges. Although such 
clustering designs may be effective in estimating total causal effects that include both direct 
effects and spillover effects, e.g., Eckles, Karrer, and Ugander (2014) and Ugander, Karrer, 
Backstrom, and Kleinberg (2013), they may be less suited towards distinguishing between the 
two effects. 

Peer Effects 

Now consider the case where we are interested in the null hypothesis of no higher order peer 
effects, Hypothesis [5j We focus again on the case with complete random assignment, although 
that is not critical. Define H to be the matrix indicating neighbors of neighbors, so that 



if i 7 ^ j A Gij — 0 A ■ Gjk > 0^ 


Again select a focal subpopulation Fp. The change in the null hypothesis does not impose 
restrictions on the choice of the focal subpopulation, although the implications of this choice for 
the power are different compared to the case where the null hypothesis ruled out the presence 
of any spillovers. The difference with the previous null hypothesis of no spillovers is in the 
definition of the restricted set of assignments W 5 . Given this null hypothesis, for individual i, 
the level set Y{i,w, Hq) now consists of the set of assignments w' such that the assignments 
are the same for i and for all i’s neighbors 

V(i,w,Ho) = {w' G W|w( = Wj A (w' = Wj for all j s.t. Gij = l)}. 

Then, as before, the restricted set of assignments is the intersection over all focal units of these 

Ws = n,gp^v(i,w,Ho). 


We can conceptualize this set in terms of a partition of the population into three subpopulations. 
Given the subpopulation of focal units Pi;’, define the set of buffer units Ps who are not focal, 
but who have one or more neighbors who are focal: 


Fi = 0 A ® 

Pb = G P 

and the set of auxiliary units P^i who are not focal, nor do they have neighbors who are focal: 

Pa = < i € P 

F,- = 0 A 

Fj — 0 

Then the restricted set of assignments keeps fixed the assignment for units who are not auxiliary, 
that is, for focal and buffer units: 

W 5 = {w G W|wi = Wj if i G Fp U Pb}. 

To visualize this consider a very simple example with a population with three units, with 
the only edge between individuals 1 and 2, corresponding to the following adjacency matrix: 


0 1 0 \ 
10 0 
0 0 0 / 

Suppose we choose unit 1 to be the focal unit, Fp = {1}. Then the set of neighbors of focal 
units, or the set of buffer units is Pb = {2 } and the set of auxiliary units is Pb = {3}. Suppose 
the actual assignment is W = (0,0,0). Then 

Wb = W(l, W, Fo) = {(0,0,0), (0,0,1)}, 

allowing only the assignments for the auxiliary unit to vary. 

Now, the experiment we consider is that of randomly assigning W within the set W 5 . Under 
those assignments we know all the potential outcomes for focal individuals. The new assignment 
mechanism is, as before, the conditional assignment probability given the assignments for non¬ 
auxiliary units, p/w) = pr(W = w|W G Wb), and the artificial experiment is 

£'= {Ws,Fp,p'{-)). 

6.1 Test Statistics 

Let us now consider test statistics for this setting. 

6.1.1 An Edge-Level-Contrast Statistic 

A natural approach to generalizing the edge-level-contrast statistic would be to focus on pairs 
of neighbors-of-neighbors, one focal and one auxiliary, and use as the test statistic the average 


outcome for focal units with treated auxiliary neighbors-of-neighbors minus the average out¬ 
come for focal units with control auxiliary neighbors-of-neighbors whose treatment varies in the 
restricted set. In order to define the latter condition, let Pa again be the set of auxiliary units, 
units who are not focal and who do not have any focal neighbors, and let Ai be an indicator 
for the event that unit i is an auxiliary unit. Then the edge-level-contrast statistic is: 

t-ho _ 


F, ■ H „. A, ■ w, ■ y.”!” f ■ H,j ■ A, ■ (I-Wj)- y.”!” 


( 6 . 10 ) 

As a practical matter, tests for higher order spillovers while allowing for first order spillovers 
are likely to have less power than tests for first order spillovers. A first reason is that generally 
one would expect higher order spillover effects to be small relative to direct effects and first order 
spillover effects. Second, in the procedure discussed here, we restrict the set of assignments Wr 
that is exploited in the calculation of the p-values by fixing not just the assignment for focal 
units, but also the assignment for all their neighbors. For a given set of focal units the test 
for first order spillover effects would have a much larger set of auxiliary units than the test 
for higher order spillover effects. To counter this, it may be important to restrict the size and 
characteristics of the set of focal units when analyzing tests for higher order spillover effects. 

6.1.2 A Score Statistic 

As an alternative to the edge-level-contrast statistic, we consider a score statistic based on a 
linear-in-means model of the type considered in Manski (1993, 2013), Goldsmith-Pinkham and 
Imbens (2013) and others, and previously here in Section 15.3.21 Under the null, we model 
the spillovers as additive and linear in the indicator for the own treatment and the fraction of 
neighbors treated: 


F) = Ol -\- Tdirect ' kFj T Apill ‘ ^ ^ kFj • Gij -|- £i, 


where as before, Gij = Gij / Em=i Gim, and zero if individual i has no neighbors. 

Assuming the assignment to treatment is completely random, we can, given this model, 
estimate the parameters a, rairect and Tspiu by least squares. We can then consider a more 
general model that allows second order effects of the treatment in addition to the first order 
effects captured by Tspiip 

N N 

F) Cr T Tfjirect ' fFj -|- TgpjH • ^ ^ Wj • Gij -|- Tgecond ' ^ ^ ' Hij 6j, 

j=l i=i 

where Hij = Hij/Y,m=iHim if > 0, and Hij = 0 if Em=i^irn = 0. The score 

statistic for the second-order spillover effect Tgecond is then proportional to the covariance be¬ 
tween the estimated residual from this regression and the fraction of second-order neighbors 
who are treated: 

( N N 

-a- fdirect • Wi - r,pni -^Wj- Gij, 

i=i i=i 




Hij > 0 

( 6 . 11 ) 


This score statistic is very similar to that in the discussion of the null hypothesis of no spillovers, 
with two modifications. First, the outcome is now also adjusted for the hrst order spillover effect, 
by subtracting fspui • ' Gij), and second, we look at the correlation of this adjusted 

outcome with the fraction of second order neighbors who is treated, instead of the fraction of 
direct neighbors who is treated. 

6.2 Choosing the Focal Subpopulation for the Null Hypothesis of No Higher 
Order Spillovers 

Given the structure of the artificial experiment for the null of no higher order spillovers, the 
key to statistical power is, in addition to the usual requirement for a sufficient number of focal 
units, the presence of auxilliary units (those who are not neighbors of any focal units) who 
are also neighbors of neighbors of focal units. Thus, we choose the focal subpopulation to, at 
least approximately, maximize the number of focal-auxiliary pairs where the auxiliary unit is a 
neighbor of a neighbor of the focal unit. 

Suppose we have a focal subpopulation Fp, now with corresponding buffer and auxiliary 
subpopulations and Pa- Consider adding a currently non-focal (buffer or auxiliary) indi¬ 
vidual i to the focal subpopulation, changing the focal subpopulation to PiT’ and the auxiliary 
subpopulation to Pa. Then Fj = Fj if j / i, and Fi = 1, Fi = 0. In addition, Ai = 0, and 
Aj = Aj ■ (1 — Gij) for j 7 ^ i: neighbors of i are removed from the set of auxiliary units. The 
number of new edges used in the edge-level-contrast statistic as a result of the change is the 
number of auxiliaray units that are neighbors of neighbors of i: 

N N N 

j=i j=i j=i 

The number of old edges no longer used in the statistic after adding unit i to the focal sub¬ 
population is determined by the set of individuals who used to be auxiliary but become buffer 
units as a result of being neighbors of i. This leads to number of edges being dropped equal to 

N N N 

EE Fk ■ {Aj - Aj) ■ Hkj + '^Fk ■ Ai ■ Hki 

k=l j=l k=l 

N N N 

= Fk ■ Aj ■ Gij ■ Hkj+ f^ ■ a^ ■ H^i 

k=l j=l k=l 

Thus, the addition of unit i to the focal subpopulation would increase the number of comparisons 

N N N N 

^N,i = X/ • Fij - '^'^Fk ■ Aj ■ Gij ■ Hkj - '^Fk ■ Ai ■ Hki 
j=l k=l j=l k=l 

N N N 

= {Aj — Ai ■ Fj) ■ Hij — Fk • Aj ■ Gij ■ Hkj■ 

j=l k=l j=l 


In cases where the alternative is proportional to the share of treated neighbors-of-neighbors, 
one may wish to optimize by choosing as the next focal unit the unit i with the highest value 

ON,i = — 


Ef=i h .3 



Ylj=i ■ Gij ■ Hkj 

Ef=i Hu, 

with the stopping rule based on whether the maximum value of <5 at,* over all remaining non-focal 
units i is positive or not. 

This algorithm will lead to a focal subpopulation with a large number of neighbors-of- 
neighbors who are auxiliary units. 

work Specifications 

In this section we consider null hypothesis regarding competing specifications of the network. 
We have two specifications of the network, Gi and G 2 , with for some pairs (i, j), Gi^ij / G 2 ,i,j- 
We test Hypothesis [6] that Yi{w) = Yi['w') for all i, and for all pairs of assignment vectors 
w, -w' € W such that Wj = u)' for all units j such that Gi^ij = 1. 

Given a set of focal units, the buffer subpopulation is now the subpopulation of units that 
are not focal, but that are neighbors with focal units under network Gi. The set of auxiliary 
units is the set of non-focal and non-buffer units. 

V(i,'w,Ho) = {w' G W|w- = -Wj A ('w' = Wj for all j s.t. Gij = l)}. 

Then, as before, the restricted set of assignments is the intersection over all focal units of these 

W5 = a6P^v(i,w,Ho). 

Next, we consider the choice of test statistics. First we consider an edge-level-contrast 
statistic. For all pairs of focal units and treated auxiliary units who are neighbors according to 
the second network, G 2 , we average the outcome of the focal unit, and subtract the average, 
over all all pairs of focal units and control auxiliary units who are neighbors according to the 
second network: 







For the score statistic we first estimate the effect of spillovers from the first network as in the 
previous section. For focal units we then calculate the covariance of the residual from this 
regression with the fraction of neighbors from the second network who are treated: 




Ef=i tr, ■ Gi.i, 


Ef.i IT, ■ G 2 ..J 

Z^j,=l ^2,ij 


[ 25 ] 


To choose the focal subpopulation we again use a greedy algorithm, starting with the empty 
set at the subpopulation of focal units. We then sequentially add new focal units, one at a time, 
by choosing the currently non-focal unit whose inclusion in the focal subpopulation would add 
the most paths between focal and auxiliary units of length two, but not of length one. 

In this section we consider a null hypothesis for heterogeneity in the treatment effects. Hy¬ 
pothesis [TJ Ti(w) = Yi{'w') for all i, and for all pairs of assignment vectors w,w' G W such 
that ■ ^ij — ■ ^ij- What we are interested in here is testing whether it 

matters which of one’s neighbors are treated, given the number of treated neighbors. It may 
be that neighbors with particular characteristics are more influential than others. This maybe 
correspond to neighbors with similar characteristics as the ego, or neighbors who have a more 
central place in the network, neighbors with whom the eog has more interactions, or neighbors 
with particularly high values for particular characteristics. 

Given a focal subpopulation, the level set is 

( N N 

= Gij 

i=i i=i 

As usual, the restricted set of assignments is the intersection over all focal units of these sets: 

Ws = n,gp^V(i,W,77o). 

To choose a test statistic we focus on the score approach. Under the null hypothesis we can 
estimate the direct and spillover effects by least squares, and calculate the residual 

..obs . . ... . 

ex Tdirect ' Wj TgpiH • 

l^j=l Gij 

There is a variety of alternative hypotheses we can consider. Here we focus on one where the 
effect of neighbor j being treated on the outcome of individual i is proportional to the degree 
of that unit (i.e., the number of neighbors Kj that this neighbor j has). This leads to 

V(i,w,i7o) = < w' G W 

COV I -a- .direct ■ ITt - fepill 


E N ^ TX 

7=1 Ljj=l 


J2KrG,j>oj . 


To implement this test we also need to choose the focal subpopulation. In this case it is 
important for focal units to have variation their friends’ degree. Thus we need focal units with 

[ 26 ] 

at least two neighbors. For each unit i we calculate for all their non-focal neighbors j how many 
non-focal neighbors this neighbor j has: 


Uij = lGij=i ■ {I - Fj) ■ ^ {1-Ff)-Gjf. 

Then we calculate the average and the standard deviation of this measure over all the neighbors 
of unit i: 


“ -Tj) 


E (i-r,){0, 

j-Gij — l 


Our approach now is to select, sequentially, focal units with high values for Sjj^i. 

In this section, we carry out two sets of Monte Carlo simulations to assess the properties of the 
proposed procedures. In the first set, we focus on testing the null hypothesis of no spillovers in 
the context of general networks. In the second, we focus on the comparison of two networks, one 
sparser than the other, and test the null hypothesis that all spillovers are first-order spillovers 
in the sparser network. 

9.1 Monte Carlo Set Up I: Testing for the Presence of Spillovers 

The following components of the simulations are common to all designs in the first Monte 
Carlo set up. First consider the potential outcomes. Let wq be the Ai-component vector with 
all elements equal to zero. Then, the baseline potential outcomes with no units exposed to the 
treatment are drawn from a Gaussian distribution; 

U(wo) ~ AA(0,1), independent across all units. 

Let W(o,j) be the Ai-component vector with all elements equal to zero other than the ith element, 
which is equal to one. We assume a constant additive direct (own) treatment effect: 

U(w(o,j)) - U(wo) = Tdirect, 

for all i = 1,..., A^. Let iLj be the number of peers for unit i and let Ki^ and Ki i be the 
number of control and treated peers. Then we assume a constant additive spillover effect that 
is proportional to the number of treated peers; 

U(w) = U(wo) Wi ■ Tdirect + ' rspiu. 

If "^spiii is equal to zero the null hypothesis of no spillover effects holds. If Tspiu / 0, the null 
hypothesis is violated. 

[ 27 ] 

The assignment to treatment is completely random with a fixed number of treated and 
control individuals. In all simulations there are 599 individuals, 300 treated individuals and 
299 control individuals. 

The Monte Carlo designs vary along five dimensions. 

1. Network Structure: We consider two network structures. 

In the first network structure we take a network of friendships from one of the high 
schools represented in the Add Health data. For details on the design of this data set see We use a subset containing information 
on 599 students with at least one friend in the school. On average each student has 5.1 
friends, with a standard deviation of 3.1, and the number of friends ranging from 1 to 18. 
In these simulations we keep the network fixed across the simulations. 

In the second network structure we sample Watts-Strogatz (1998) small world networks 
with A: = 10 and probability of rewiring p = 0.1. The degree distribution thus has mean 
10 and standard deviation 1.37. The size of the network is the same as in the Add Health 
network, 599. 

2. Statistic: We consider three statistics. 

The first is the edge-level-contrast statistic Tgic, equal to the difference in average ego 
outcomes over all edges with focal egos and treated alters and the average of ego outcomes 
over all edges with focal egos and control alters, as given in ()5.6I) . The second is the score 
statistic Tscore given in (|5.8p . motivated by a Manski-style linear-in-means model with 
endogenous peer effects. The third is the Aronow statistic Thin, which is the difference in 
average outcomes for focal units with at least one treated neighbor and those with only 
control neighbors. 

3. Own Treatment Effect: We allow the own treatment effect Tdirect to take on the values 

0 and 4. 

4. Spillover Effect: We allow the spillover effect Tspiu to take on the values 0 and 0.4 to 

assess size properties of the test under the null hypothesis as well as power of the test 
under the alternative hypothesis. 

5. Location and Number of Focal Units: We compare three methods for choosing the 

focal units. In the first we randomly select 300 (approximately half) the individuals to be 
focal. In the second we use the e-net approach. In the Add Health network this approach 
leads to 213 (36%) focal individuals, and in the small world networks it leads on average 
to 98 (16%) focal individuals. In the third we maximize the number of edge comparisons, 
weighted by the number of neighbors, using the procedure described in Section 15.4.31 In 
the Add Health network this approach leads to 237 (40%) focal individuals, and in the 
small world networks it leads on average to 128 (21%) focal individuals. 

We approximate the p-value by drawing from the randomization distribution of the statistic 
under the null 1,000 times, and calculating the proportion of of the draws where the absolute 

[ 28 ] 

value of the statistic is larger than the absolute value of the statistic calculated on the actual 
data. We then report the fraction of replications, over 4,000 replications, where the p-value is 
less then 0.05. 

The results are presented in Tabled! We note a couple of the findings. First of all, when the 
null hypothesis is true, the tests all perform as expected, with the p-values less than 0.05 the 
appropriate number of times. When the null hypothesis is false we do see that the tests have 
substantial power. As discussed in the theoretical sections, the choice of focal units matters 
substantially for the power of the tests. Random selection of focal units performs quite poorly 
compared to more systematic ways of choosing the focal units. Both the method based on 
optimizing the number of focal-nonfocal friendships and the e-net approach work substantially 
better. The choice of test statistic also matters a great deal, the score statistic, designed to 
be optimal for interesting alternatives performs better than the edge-level-contrast statistic or 
the Aronow statistic. The structure of the network appears to matter less. Results for the Add 
Health network and the small world network are similar. 

9.2 Monte Carlo Set Up II: Testing for Sparsification 

In the second set of simulations we focus on tests for the presence of spillovers beyond the first 
order spillover of a sparser network. In the simulations we take the original Add Health network 
with 599 students as the baseline network. We create a sparser network by randomly cutting 
each edge in the Add Health network with probability q, where either q = 0.9 or q = 0.5. This 
leads to a network with average degree 0.43 (if we cut 90%) or 2.57 (if we cut 50%), compared 
to 5.15 in the original network. 

We randomly assign 300 of the students to the treatment. We then simulate outcome data 
according to the linear in means model: 

U — Tiirect ' lU T '^spill ' kF-|- Ej, 

where kF(i) is the fraction of neighbors who are treated, with weight 0 < A < 1 for edges that 
are only present in the second, less sparse, network: 

( Sf.i (Gi.i, + A. (G2.i, - Oi,j)) \ 

If A = 0 the sparsification is appropriate because the edges only in the second network do 
not matter. If A = 1, the edges in the second network are just as important as those in the 
first network. We simulate the as independent and identically distributed, with W(0,1) 

We focus on two statistics. For the first statistic, Tgcore, based on the covariance of the 
residual based on the model under the null and the share of treated second-network neighbors 
in (I7.13p . The specific statistic we focus on is the correlation between this residual and the 
fraction of treated neighbors for the focal individuals. The second statistic, Teic, is the difference 
of two averages over all edges between focal and auxiliary individuals in ()7.12p . The focal 
subpopulation is selected using the greedy algorithm described in Section [7| 

[ 29 ] 

We present results for a number of designs in Table [2l Again the test work as expected 
when the null hypothesis is true. The power of the test is generally higher if the spillover effect 
is larger Tspin = 0.4 rather than Tspin = 0.1), not surprisingly given that under the alternative 
the spillover effect for the second network neighbors is proportional to that for the first network 
neighbors. It is also higher if the sparsification of the network is more substantial {q = 0.9 
rather than q = 0.5). Finally, as expected the score based statistic has more power than the 

In this paper we develop new methods for testing hypotheses with experimental data in settings 
with a single network. We focus on the calculation of Fisher-type, exact, finite sample, p- 
values. The complication is that the hypotheses we are interested in are not sharp, so that 
conventional methods for calculating exact p-values need to be modified. We show that by 
analyzing an artificial experiment, different from the one actually performed, one can calculate 
exact p-values for interesting hypotheses regarding spillovers, sparsification of networks, and 
peer effect heterogeneity. We illustrate approaches for selecting test statistics as well as the 
details of the artificial experiment to maximize statistical power. We illustrate the new methods 
by carrying out simulations. 

[ 30 ] 

Bond, Fariss, Jones, Kramer, Marlow, Settle, and Fowler (2013), Bond et al. from hereon, are also interested 
in testing for spillovers (Hypothesis [2|. They wish to use testing procedures that are robust to the network 
structure. We show here analytically that there procedures are not valid in general, and can lead to over¬ 
rejections of 0.05-level tests at rates as high as 0.20 because they ignore the variation arising from own treatment 

Bond et al. focus on the difference between the average of an ego’s outcome over all edges where the alter is 
exposed, and the average over all edges where the alter is not exposed: 


Gij ■ Wj ■ V G., • (1 - Wj) ■ V 


Under Hypothesis [2] the expected value of this statistic is zero, which makes it promising for testing this hy¬ 
pothesis. However, because of the network structure there may dependence between the terms in each of these 
averages, and its variance is difficult to estimate for a general network structure. 

Bond et al. look at a randomization-based distribution for this statistic to test the null hypothesis of no 
spillovers. The distribution is obtained by re-assigning the treatment vector W, assuming there is no effect of 
the treatment whatsoever, and deriving from there the quantiles of the Tb distribution. This implicitly assumes 
for these calculations that there is no effect of the treatment whatsoever (Hypothesis [T|), which is stronger than 
the no-spillover null hypothesis (Hypothesis [21l that they are interested in testing. The reason for this is that if 
one allows for direct effects of the treatment on the own outcomes, and only assumes no spillovers, one cannot 
infer the value of the statistic Tb for alternative values of the treatment assignment vector: the no-spillover null 
hypothesis is not sharp. The concern is that using the randomization that is based on a stronger null hypothesis 
is not innocuous. Bond et al justify the use of this method using simulations in which the stronger null is true. 

Here we show through analytic calculations for a particular example that p-values based on these calculations 
are not valid, even in large samples, let alone in finite samples, and that the deviations from nominal rejection 
probabilities can be substantial. In general, because their calculations ignore one source of variation in the 
distribution of the statistic, the p-values will be too small, leading to rejections of 0.05-level tests at rates as high 
as 0.20. 

We focus on an example with a particular network structure that allows us to simplify the large sample 
approximations. The population consists of 2 ■ A units, partitioned into N pairs. Out of these 2 • N units N 
units are randomly selected to be exposed to the active treatment. We maintain the assumption that there are 
no spillovers. The potential outcomes are 

Vj(0) = 0, and ^(1) = 1, 

so that the direct treatment effect is equal to 1. The N pairs can be partitioned into three sets: Moo pairs with 
both units exposed to the control treatment, Mqi pairs with exactly one unit exposed to the control treatment 
and one unit exposed to the active treatment, and Mn pairs with both units exposed to the active treatment. 
The number of each of these sets. Moo, Mqi, and Mn are random, but, because the total number of pairs is 
fixed at N, it follows that Moo + Moi -I- Mn = N, and because exactly N units are exposed to the active 
treatment, it must be the case that Moo = Mn. Hence we can rewrite these numbers in terms of a scalar 
random integer: define M = Mn, so that Moo = M, and Mqi = N — 2 ■ M. The expected value of M is 
N ■ (1/2) ■ {{N — l)/(2 ■ N)) « A/4. However, the variance is not A ■ (1/4) ■ (3/4) because of the fixed number 
of treated units. We can approximate the large sample distribution of %/A(M/A — 1/4) by looking at the joint 
distribution for (%/A• (Moo/A— 1/4), %/A■ (Moi/A— 1/2), %/A■ (Mn/A— 1/4)), based on independent random 
assignment to the treatment for each unit. This leads to 

/ VN- (Moo/A- 1/4) \ 

V VA-(Mn/A-1/4) J 

This implies that 

f VA.(Mn/A-l/4) 

\ Vn ■ {2- Mii/N + Moi/N) J 

Now define M = Mn and condition on Mqi/A -|- 2 • Mn/N = 0. Because the correlation between y/N ■ 
(M\\/N — 1/4) and \/N ■ (Moi/A -|- 2 ■ Mn/A is p = A/sqrt2A, the conditional variance of \/N ■ (Mn/A — 1/4) 

[ 31 ] 

given y/W ■ (Moi/N + 2 • M\\/N = 0 is (3/16) • (1 — p^) = 1/16, and 

Now consider the statistic Tb- We calculate first the actual distribution of this statistic under the random¬ 
ization distribution. Then we compare this to the distribution Bond et al use for the calculation of p-values. 

There are 2 • N edges. Out of these N have treated alters and N have control alters. For the N edges 
with treated alters 2 • Mu — 2 ■ M have treated egos, and so have realized outcome equal to Fi(l) = 1, and 
Moi = N — 2 ■ M have control egos, and so have realized outcomes equal to Yi{Q) = 0. The average realized 
outcome for egos with treated alters is therefore 2 • M/N. Similarly, for the N edges with control alters, there 
are 2 • Mqo = 2 • M edges with control egos and realized outcomes Fi(0) = 0, and Mqi = N — 2 ■ M edges with 
treated egos and thus Ti(l) = 1, leading to an average realized outcome equal to 1 — 2 • M/N. Hence the value 
of the statistic is 


The actual distribution of the normalized statistic, under random assignment, is 

Vn.Tb = Vn- ^Af(0,l). 

Now consider the distribution used by Bond et al for the calculation of their p-values. They calculate 
the randomization distribution, assuming that there are no effects of the treatment whatsoever. Under this 
randomization distribution, there are always N egos with treated alters, and N egos with control alters. Out of 
the 2 • N units there are N with realized outcome equal to 1 and N with realized outcome equal to 0, so that 
the total average outcome is exactly 1/2. Hence, if the average of the outcome for the egos with treated alters 
is equal to Tt, the average of the outcome for egos with control alters is equal to Tc = 1 — Tt. Therefore the 
difference in the average outcome for egos with treated alters and the average outcome for egos with control 
alters is equal to 2 • Tt — 1. To infer the randomization distribution used by Bond et al, we need to infer the 
distribution of Tt under their randomization distribution. We can write Tt as 





where W[ is an indicator for unit i having a treated alter. We are interested in this distribution under random 
assignment of Zi, with ~ for fixed Y. (It is the treating of Y as fixed that is not correct here - if we 

change the treatment of the alter for unit i we may be changing the value of the outcome for uniti’s alter. Thus 
the Yi are stochastic, leading to additional variation in the test statistic that is not taken into account in the B 
procedure.) Note that Yi = N and Wf = N. The treatments (and thus the peer treatments) are 

randomly assigned, with pr(Wf = 1) = 1/2 and pr(IUf = 1\WJ = 1) = (Y-1)/(2-1). Define A = 2-Wf-1 
so that Wf = {Di + l)/2, and 

E[A]=0, 01 = 1, E[A • -Pj] = - 2 . _ 1 ’ j^i. 



i = l 

Di + 1 


_ 1 _ 



1 1 
2 ^ 




1 1 
2 ^ 







1 / 2 , 

/2N \ ^ 

4-Y2 ® 





+ y-y. 

\i^l / 



[ 32 ] 

2N 2N 

= + 471^ ■ 

i=\ i=\ j^i 

= — - - — ■ N ■ (N -1) -i- 

A-N 4-iV2 ^ ’ 2-N-1 

_ 1 1 AT - 1 

“ 4-N ~ 4-N ' 2-N- 1 ~ 8-N' 

Hence the variance of A'^ • Ft is equal to 1/8, and thus the variance of Bond et al randomization distribution is 
4 • A^ • V(yt) which is equal to 0.5. The actual distribution has variance equal to 1, which is twice as large. The 
implication is that the for a two-sided test at the 0.05 level the rejection probability based on using the incorrect 
Bond et al randomization distribution is 0.157. Bond et al implicitly use the wrong variance of 0.5 for the test 
statistic, leading to 

pr (yi- \Tb\ > 1.96) =pr(|TB| > V2 ■ 1.96) 

= pr (^\Tb\ > ^ prdrel > 1.386) ^ 0.157. 

We carried out a small simulation study to verify these analytic calculations. We use N — 1000 pairs, 10,000 
replications, and use 1,000 draws from the randomization distribution. We reject the null hypothesis if the Bond 
et al p-value is less than 0.05. This leads us to reject at a rate equal to 0.153, close to the theoretical rejection rate 
we calculated above which is equal to 0.157. (A 95% confidence interval for the rejection rate is (0.144,0.163)). 

In terms of the potential outcomes the linear-in-means model in (15.91) corresponds to 

Y (w) = ao ■ {I - Tendog ' ^ ^ ' l-N + Tdirect ' (/ “ Tendog ' G) ^ W + (I - Tendog ' G) ^ (B.l) 

The expected value of the observed outcomes given the assignment is, given the random assignment, 

E[Y°‘=dW = w] = E[Y(w)] = QO • (7 - Tendog •G)-^A,-b Tdirect - (/-Tendog ■G)-'w. (B.2) 

Under the null hypothesis that Tendog = 0, the least squares estimates for the remaining parameters based on 
outcomes for focal units are 

/s — -obs .. ^ * -obs — -obs 

^0 ~ ^ F,0^ and. Tdirect ~ ^ F,1 ^ F,c? 

where, for ti? = 0,1, Y is the average outcome for focal units with Wi = w, 

—obs _ 
r p — 


^F,w . „ 1 T^/ 

z:Fj^ = l,Wi=w 


and Nf,^ is the number of focal units with Wi = w. Hence the residual under the null is 

^null Trobs TIT’ 

do lYi ’ Tdirect* 

Under normality of the outcome the score for Tendog = 0 is proportional to the covariance of the residual under 
the null and the derivative of the expectation in dEll, with respect to Tendog, evaluated at Tendog = 0. The 
derivative of the expectation at Tendog = 0 is 



E[Y°'^'=|W] = ao • Gtjv -f Tdirect ' GW = ao • G(tN -w) + (Tdirect -f ao) • GW. 

Substituting Yf,o for r^o and — Yf,o for Tdirect suggests that a natural test statistic would be the covariance 
of the residual under the null and Y^o ’ G(tjv — W) -f Y^i • GW. This leads to the following average score: 





- Y°f"o - Wi ■ 

(Yfj - Yf,o)) 



1 - 

Wj) ■ Y 



+ W,-Y 



[ 33 ] 

Because Gij — 1, in combination with the fact that the residuals average to zero, it follows that the score 

statistic is proportional to the covariance between the residual under the null and Gij ■ Wj, which is the 

fraction of treated neighbors, leading to the score statistic 


= cov ( - Yf; - Wi ■ (nti - Wj ■ 

^ Gij > 0, Fi = 1 

= Cov ( - Q - TdUect ■Wi,J2Wj- Gi, 

J = 1 

which is the expression in (15.8t . 

^ G,j > 0, F = 1 

[ 34 ] 


Table 1: Rejection Rates of Null Hypothesis of No Spillovers 







Focal Vertex Selection 
Random e-net 6^,1 

Add Health 





















^ score 



















^ score 





































Small World 








(R = 10,p™ = 0.1) 







































































Table 2: Rejection Rates of Null Hypothesis of No Spillovers Beyond the First 
Order Spillovers from the Sparsified Network, AddHealth data, 10,000 Repli¬ 





Prop of Links Dropped 
g = 0.9 q = 0.5 








































































































